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Abstract 

The paper presents investigations on the implementation and perfor¬ 
mance of the finite element numerical integration algorithm for first order 
approximations and three processor architectures, popular in scientific 
computing, classical CPU, Intel Xeon Phi and NVIDIA Kepler GPU. A 
unifying programming model and portable OpenCL implementation is 
considered for all architectures. Variations of the algorithm due to dif¬ 
ferent problems solved and different element types are investigated and 
several optimizations aimed at proper optimization and mapping of the al¬ 
gorithm to computer architectures are demonstrated. Performance models 
of execution are developed for different processors and tested in practical 
experiments. The results show the varying levels of performance for differ¬ 
ent architectures, but indicate that the algorithm can be effectively ported 
to all of them. The general conclusion is that the finite element numeri¬ 
cal integration can achieve sufficient performance on different multi- and 
many-core architectures and should not become a performance bottleneck 
for finite element simulation codes. 


1 Introduction 

Finite element numerical integration forms an indispensable part of practically 
all complex finite element codes. Apart from special formulations for specific 
problems (such as e.g. applications in structural mechanics for beams, plates 
etc.) where it can be avoided using some analytically obtained formulae, it is 
universally used for creating entries to finite element stiffness matrices and load 
vectors (matrices and right hand side vectors for systems of linear equations). 
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The recent progress in hardware for numerical computations caused the 
widespread use of multi-core and massively multi-core (often termed many- 
core) processors with vector processing capabilities. These created the necessity 
to reconsider the implementations of scientific codes, taking into account the 
possible options for multi-threading and the changing execution environment 
characteristics. 

The aim of the research presented in the current paper is to thoroughly 
analyse the process of finite element numerical integration for first order ap¬ 
proximations, in a possibly broad context, and to propose a set of guidelines for 
designing efficient integration procedures for multi-core processors, taking into 
account SIMD modes of execution. The paper shows also some variants of the 
algorithm, that are implemented for different hardware platforms and tested in 
practice. 

The paper is organised as follows. First, finite element numerical integration 
is described in the form investigated in our work. Then, in Section 3 its imple¬ 
mentation, in the context of finite element simulations and for different types 
of problems, elements and computing platforms is analysed. Several variants of 
the procedure are developed, and, in the next section, their optimization and 
mapping to different processor architectures is discussed. Section 5 presents the 
results of computational experiments that test the performance of procedures 
with short discussion of results. The last two sections briefly document other 
works related to the subject presented in the current paper and present the final 
conclusions. 


2 Finite element numerical integration 

2.1 Finite element weak statements and systems of linear 
equations 

In the current paper, we consider the finite element method as a tool for finding 
approximate solutions to partial differential equations, specified in a computa¬ 
tional domain with suitable boundary conditions imposed on 912, based on 
weak formulations that can be expressed (with some details omitted) as follows 

m- 

Find an unknown function u belonging to a specified function space 
such that 

/ -|- c^'^w^iU + c^’^wu^i + c^'^wu) dil -I- BTL = (1) 

Jn ' ’ 

= I {d°w + d^w^i)dn + BTR 
Jn 

for every test function w defined in a space Vq. 

Above, c*’-^ and d*, i,j = 0 ,..,Nd denote the coefficients specific to the 
weak formulation (with Njj being the number of physical space dimensions). 
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”denotes differentiation with respect to space coordinates and summation 
convention for repeated indices is used. BTL and BTR denote boundary terms, 
usually associated with integrals over the boundary The task of calculating 
boundary terms is usually much less computationally demanding, and we neglect 
it in the current paper, when discussing implementations for multi- and many- 
core architectures. In the actual computations described later in the paper this 
task is performed by the part of the code executed by CPU cores, using standard 
finite element techniques. 

There are many possible finite element approximation spaces that can be 
defined for triangulations of U into a set of finite elements fie, e = 1,...,Ne 
In the current paper we concentrate on scalar problems and the popular case 
of first order approximations, i.e. spaces spanned by elementwise linear (or 
multilinear) basis functions, associated with element vertices. Vertices with the 
associated basis functions form the nodes of the finite element mesh, with their 
number denoted by Piecewise linear basis functions are defined for simplex 
elements (triangles in 2D and tetrahedra in 3D), multilinear basis functions are 
specified for other types of elements (quadrilateral, prismatic, hexahedral, etc.) 

After representing unknown and test functions as linear combinations of 
basis functions , r = 1, the system © is transformed to a set of linear 

equations for the unknown vector of degrees of freedom U : 

Nn 

= ¥ r = l,..,NN ( 2 ) 

S=1 

with the entries in the global system matrix A (the stiffness matrix) computed 
as (with boundary terms neglected): 


^ +E E +c°’°rrj an (3) 

The entries to the global right hand side vector b (the load vector) are 
calculated according to the formula (again with boundary terms neglected): 


b^ = 



(4) 


The solution of a single problem ([T]) (or equivalently the system ([2])) may 
constitute the only step of the finite element solution process or may form 
a single stage of more complex solution strategies, e.g. for time dependent 
and/or nonlinear problems. In the latter case the efficient accomplishment of 
numerical integration becomes even more important for the overall performance 
of simulation programs. 


2.2 Finite element solution procedures 

The solution of a single finite element problem represented by o or @ requires 
the creation of global stiffness matrix and load vector entries, followed by their 
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use in obtaining the final solution of the problem. A single entry in the global 
stiffness matrix or load vector consists of an integral over the whole computa¬ 
tional domain plus some boundary terms. The integral over the computational 
domain, on which we concentrate in the current paper, can be expressed as the 
sum of contributions from individual elements: 



(5) 


The entries can be either computed individually, in a double loop over all 
entries of the global stiffness matrix, or can be computed in a loop over all 
elements, with all the entries related to a given element computed at once. 
We consider in the current paper only the latter strategy. Compared to the 
first strategy where each finite element is visited several times, this saves some 
element related calculations but leaves as a result element contributions that 
have to be further processed. Our choice is motivated by the fact that for 
many types of problems, including nonlinear, the overhead of repeated element 
calculations may become too high. In our approach we do not decide, however, 
how element contributions are used further in the solution procedure. Different 
options, with the assembly of the global stiffness matrix [7] or without assembly, 
in the so called matrix-free approaches [T], can be applied. 

The element contributions to the global stiffness matrix form a small dense 
matrix A®, with each entry corresponding to a pair of element shape functions 
(f) (that are used for creation of specific global basis functions). The entries of 
the element stiffness matrix (with boundary terms neglected) are given by the 
formula: 



(A«) 


where now indices r and s are local indices, with the range from 1 to Ns - the 
number of shape functions for a given element. 

The element right hand side vector, that forms the element contribution 
to the global right hand side vector, is computed based on the formula (with 
boundary terms neglected): 



2.3 A generic finite element numerical integration algo¬ 


rithm 


We consider in our paper the creation of element stiffness matrices and load 
vectors using numerical integration and transformations to reference elements. 
Both procedures are related, since quadratures for numerical integration are 
usually defined for reference elements. 
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Shape functions are defined for a reference element of a given type of geom¬ 
etry and approximation. Each real element in the finite element mesh is treated 
as an image of the associated reference element under a suitable geometric trans¬ 
formation. Denoting the physical coordinates of points in the real element by 
X, the transformation (defined separately for each element) from the reference 
space with coordinates ^ is expressed as £c(^) (with the inverse transformation 
given by ^{x)). 

The parameters of the transformation are used for calculations of global 
derivatives of shape functions that involve the elements of the Jacobian matrix 


of the inverse transformation ^{x) (the elements of Jacobian matrices 



and 1^1 with their determinants, are later on, when considering actual com¬ 
putations, termed together as Jacobian terms). 

The transformation from the reference element Cl to the real element Clg is 
used to perform the change of variables in integrals from ([B|) and O- Taking 
an example integral from ([5]) and applying the change of variables we obtain: 


EE< 




34 )'^ 90 * 

dxi dxn 


90 ’' d^k 90 * 34,2 




dxo 


detJdCl (8) 


where cjf denotes the shape functions defined for the reference element. 

The application of numerical integration quadratures (in practical examples 
we always use Gaussian quadratures) leads to the transformation (for brevity 
we use global derivatives of shape functions): 



,,,- 90 ’' 90 * 

3xi 3xj 


Nq 


E 

Q=1 



c ’ 


90 ’' 90 * 


dxi 3xi 


det J 


w 


Q 




(9) 


where 4^ denotes the coordinates of the subsequent integration points with 
the associated weights w^, Q = 1, ■■,Nq {Nq being the number of integration 
points related to the type of element, the degree of approximation and the form 
of coefficients c’d). 

The input for the generic procedure of numerical integration for a single real 
element, that implements ([HI) and 0, with (m, (|ni) and similar expressions for 
other terms taken into account, is given by: 


• the type of element, the type and the order of approximation 

• the set of local coordinates and weights for integration points 

• possibly the set of values of shape functions and their local derivatives at 
all integration points for the reference element (they can be also computed 
on the fly during integration) 

• the set of geometry degrees of freedom for the real element 
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• the set of problem dependent parameters that are used for computing 
coefficients c®’-^ and d® at integration points (for simple problems these 
may be the values of coefficients at integration points themselves) 

The output of the procedure is formed by the element stiffness matrix A® and 
the element load vector b®. Algorithm [T] shows a generic procedure for creating 
a sequence of element stiffness matrices and load vectors, assuming they have 
the same type and order of approximation, with the data for the respective 
reference element (integration data and shape functions data) assumed to be 
available in local data structures. 


Algorithm 1: A generic algorithm for finite element numerical integration for the 
sequence of elements of the same type and order of approximation 

1 for e = 1 to Ne do 

2 - read problem dependent parameters specific to the element 

3 - read geometry data for the element 

4 - initialize and 

5 for iq = 1 to Nq do 


compute necessary terms of Jacobian matrices ^ and 


and calculate 
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8 

9 

10 
11 
12 

13 

14 

15 

16 

17 

18 

19 

20 
21 
22 

23 

24 

25 

26 

27 

28 

29 

30 end for 


the value of variable vol[iQ] = det(^) x w'^liq] 
for is = 1 to Ns do 

- using Jacobian terms calculate the values of global derivatives of shape 
functions at integration points, 4> 

end for 
end for 

for ig = 1 to Nq do 

I - calculate the actual coefficients c and d at integration points 

end for 

for iq = 1 to Nq do 
for is = 1 to Ns do 
for js = 1 to Ns do 
for in ~ 0 to No do 
for jo = 0 to No do 
A'®[is][is]+ = 

vol[iQ] xc[iD]bA][iQ] X </)[iD][is][iQ] x ct)[jD][js][iq] 

end for 
end for 
end for 

for io = 0 to No do 

I b®[is]+ = vol[iQ] xd[io][iq] x (f>[io\[is][iq] 

end for 
end for 
end for 

- store the whole A® and b® as output of the procedure 


The algorithm shows how numerical integration calculations can be per- 
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formed. As it can be seen, there are three subsequent, separate loops over 
integration points in the algorithm. They can be fused together, as it is often 
done in practice. The presented form of Algorithm [T] has been chosen to show 
that, with the suitable arrangement of calculation stages, the final computations 
of stiffness matrix entries are performed in five nested loops, with the order of 
loops being arbitrary. Each order can, however, lead to different execution per¬ 
formance, that additionally depends on the hardware on which the code runs. 
This opens multiple ways for optimizing the execution characteristics of finite 
element numerical integration that we investigate in the following sections. 


3 Implementation of numerical integration 

3.1 The place of numerical integration within the finite 
element solution procedure 

We treat the procedure of numerical integration as one of computational kernels 
of finite element codes. By this we want to stress its importance for finite element 
computations and its influence on the performance of codes. To further clarify 
its place in finite element programs we describe a way in which, in the setting 
assumed in the paper, it interacts with the rest of the code. 

We consider the implementation of finite element numerical integration in 
codes designed for large scale calculations. The number of elements in such 
calculations often exceeds million and even can reach several billions. The only 
way to ensure the scalable execution of programs for that size of problems is to 
consider distributed memory systems and message passing. The standard way 
of implementation in this case is some form of domain decomposition approach. 
In the current paper we assume the design of finite element codes, in which 
the issues of message passing parallelization and multi-core acceleration are 
separated [2] . In this design multithreading concerns only single node execution 
and a separate layer of software combines single node components into a whole 
parallel code. 

As a result we consider the algorithm of numerical integration executed for all 
elements of a given subdomain. We assume the size of the subdomain to exceed 
one hundred thousand of elements, the size that justifies the use of massively 
multi-core accelerators. We allow for complex mesh and field data structures 
related to adaptive capabilities, many types of elements, several approximation 
fields and different approximation spaces in a single simulation. Therefore we 
assume that the finite element data concerning elements and approximation 
fields are stored in standard global memory accessible to CPUs. This memory 
should ensure the sufficient storage capacity and a latency minimizing way of 
accessing data. 

Hence, the input data to numerical integration comes from finite element 
data structures stored in CPU memory. The output of numerical integration 
is formed by element stiffness matrices and load vectors. In order to allow 
for broad range of scenarios of using the computed entries (standard direct. 
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frontal, iterative solvers, with or without matrix assembly) we assume that the 
output arrays after being computed, can be stored in some memory, available 
to threads performing subsequent solution steps (this can even include registers 
if e.g. assembly operation is designed as extension of procedures performing 
numerical integration). Hence, we discuss the problem of effective storing of 
element arrays in memory of the processor performing integration, but consider 
also the case, where the data are not stored at all. In both cases the further 
use of element stiffness matrices and load vectors depends on implementation 
details of possible assembly and solver procedures. 

3.2 Parallelization strategy 

As can be seen in Algorithm [U numerical integration in the form considered 
in the current paper, consists of several loops. Each of these loops can be 
parallelized, leading to different parallel algorithms with different optimization 
options and performance characteristics. 

For the analysis in the current paper we consider parallelization of only one 
loop - the loop over finite elements. Due to the sufficiently large number of 
elements in the loop (as assumed for message passing implementation of large 
scale problems) and the embarrassingly parallel character of the loop this choice 
seem to be obvious. By analysing it in details, we want to investigate its opti¬ 
mization and performance characteristics, but also to show possible limitations 
that can lead to different parallelization strategies. 

Numerical integration is an embarrassingly parallel algorithm, but the later 
stages, assembly and linear system solution, are not. To allow for efficient 
execution of these stages, further assumptions are made. As was described 
when developing Algorithm [T1 elements of a single subdomain are divided into 
sets of elements of the same type and degree of approximation (to made them 
use the same reference element shape functions and integration quadratures). In 
order to allow for parallel assembly, the elements are further divided into subsets 
with different colors assigned. Elements of a single color do not contribute to 
the same entries in the global stiffness matrix and their stiffness matrices can 
be easily assembled into the global stiffness matrix in a fully parallel way, with 
no data dependencies. 

3.3 Variations of the algorithm 

Despite the fact that finite element numerical integration can be expressed by 
two simple formulas ([H]) and o, there exist many variations of the algorithm 
(even when considering only first order approximations), that should be taken 
into account when designing computer implementations. 

First, there are different types of problems solved, scalar and vector, linear 
and non-linear. Depending on the weak statement of the corresponding problem, 
the set of PDE coefficients can be either sent as input to the procedure or 
the input to the procedure can form just the arguments of a special, problem 
dependent function, that calculates the final coefficients. In the current paper 


we consider the first option, with the assumption that the coefficients for element 
stiffness matrices are constant over the whole finite element. 

The coefficient matrices for different problems (matrices c in Algorithm [1} 
can have varying non-zero structure. To make our implementations, to certain 
extent, independent of these variations, we consider the organization of compu¬ 
tations with the final calculations of updates to the stiffness matrix (two loops 
over space dimensions id and jo ~ lines 17-22 in Algorithm[T]) always performed 
inside all other loops. With such a solution, these final calculations are left as 
problem dependent and should be optimized separately, for each problem, either 
manually or using some automatic tools [5S]. 

Another variation is related to different geometrical types of finite elements: 
linear, multilinear and geometrically higher order (e.g. isoparametric for higher 
order approximations). The main difference, from the point of view of numerical 
integration, is between linear simplex elements, for which Jacobian terms are 
constant over the whole element, and all other types for which Jacobian terms 
have to be computed separately for each integration point. In our study we 
compare two types of elements: geometrically linear tetrahedral elements and 
geometrically multi-linear prismatic elements (both are elements without curved 
boundaries, with geometry degrees of freedom given by vertices coordinates). 

For the two different types of geometry of elements we consider different 
variations of numerical integration for first order approximations. We explicitly 
take into account the fact that for geometrically linear elements the derivatives 
of geometric shape functions and solution shape functions are constant over each 
element and that the corresponding calculations can be moved out of the loop 
over integration points. 

Finally, we consider several options for arranging the order of calculations 
in Algorithm[TJ For the three loops, over integration points (fg) and twice over 
shape functions {is and js), we consider the different orderings and denote them 
by QSS, SQS and SSQ (the symbol reflects the order implied by the subscripts of 
the loop indices). We assume that there are no separate loops for computing the 
additional quantities, Jacobian terms and global derivatives of shape functions, 
but that the loops are fused together. 

Combining together the options discussed above we present six algorithms 
da s nisi El III) for the final calculations of entries to A® (and 6® as well). The 
algorithms correspond to: 

• different loop orderings - QSS, SQS and SSQ versions 

• different types of elements - geoJinear for geometrically linear elements 
and geo-generic for all other types (it can be used for geometrically linear 
elements, but is less efficient) 

• different types of problems - through the delegation of final updates to 
problem specific calculations. 

In the algorithms it is assumed that the values of Jacobian terms and the 
global derivatives of all shape functions are calculated just after entering the 
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calculations for a given integration point. The global derivatives are stored in 
some temporary array and later used in calculations. In practical calculations, 
in order to save the storage for temporary arrays, another strategy can be used 
with the values of global derivatives for each shape function computed just at 
the beginning of the body of the loop for this shape function. The drawback of 
the approach is that the calculations in the innermost loop over shape functions 
are redundantly repeated for each shape function. 


Algorithm 2: The QSS-geo.generic version of numerical integration algorithm (ex¬ 
planation in the text) 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 


for iq = 1 to Nq do 

calculate ^ and vol (and possibly coefficients c) 
for is = 1 to Ns do 

I calculate global derivatives of shape functions at the integration point 

end for 

for is = 1 to Ns do 
for js = 1 to Ns do 

I update A=[is][j5] using c[i£,lbD][*Q], </>[*D][fsl[*Q] and (l)[jD][js]liQ] 

end for 

update b®[is] using dyioWiq] and (/)[iD][is][*Ql 

end for 
end for 

store the whole A'^ and b® as output of the procedure 


Algorithm 3: The QSS-geoJinear version of numerical integration algorithm (ex¬ 
planation in the text) 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 


calculate ^ and vol 
for is = 1 to Ns do 

I calculate global derivatives of shape functions 

end for 

for ig = 1 to Nq do 

(calculate possibly coefficients c) 
for is = 1 to Ns do 
for js = 1 to Ns do 

I update A®[is][js] using c\iD\[jD][iQ], (i>[iD][is\[iQ] and 4>[3D]\js][iQ\ 

end for 

update b®[is] using d[i_D][iQ] and temp[i_D] 

end for 
end for 

store the whole A® and b® as output of the procedure 


The six variants of numerical integration can be considered as the optimiza¬ 
tions of the algorithm, corresponding to classical techniques of loop interchange 
and loop invariant code motion. We perform them explicitly, to help compilers 
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Algorithm 4: The SQS.geo_generic version of numerical integration algorithm (ex¬ 
planation in the text) 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 


for is = 1 to Ns do 
initialize row of A® 
for iq = 1 to Nq do 

calculate ^ and vol (and possibly coefficients c) 

calculate global derivatives of shape functions at the integration point 
for js = 1 to Ns do 

I update A^[is][js] using c\iD\[jD\[iQ], <l>[iD\[is\[iQ] and 0[jD][js][^Ql 

end for 

update b^[is] using d[iD][iQ\ and cl>[iD\[is][iQ] 

end for 

store row of A® and h^[is] in global memory 

end for 


Algorithm 5: The SQS-geodinear version of numerical integration algorithm (ex¬ 
planation in the text) 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 


calculate ^ and vol 
for iq = 1 to Nq do 

I calculate global derivatives of shape functions 

end for 

for is = 1 to Ns do 
initialize row of A® 
for iq = 1 to Nq do 

(calculate possibly coefficients c) 
for js = I to Ns do 

I update using c\iD\[jD][iQ], (/>[*D][jsl[iQ] and 4>[3D][3s][iQ\ 

end for 

update b®[is] using d[iD][iQ] and <A[*D][is][*Q] 

end for 

store row of A® and in global memory 

end for 
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Algorithm 6: The SSQ-geo.generic version of numerical integration algorithm (ex¬ 
planation in the text) 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 


for is = 1 to Ns do 
for js = I to Ns do 

A® [is] [is] = 0; b®[is] = 0 
for iq = 1 to Nq do 

calculate ^ and vol (and possibly coefficients c) 

calculate global derivatives of shape functions at the integration point 
update A®[is][is] using c\iD\[jD][iQ], (t>[iD][is\[iQ] and 4>[3D]\js\[iQ\ 
if is = js then 

I update b®[is] using djinWig] and </>[iD][is][iQ] 

end if 
end for 

store and h^[is] (if is = js) in global memory 

end for 
end for 


Algorithm 7: The SSQ-geoJinear version of numerical integration algorithm (ex¬ 
planation in the text) 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 

16 
17 


calculate ^ and vol 
for is = I to Ns do 

I calculate global derivatives of shape functions 

end for 

for is = 1 to Ns do 

for js = I to Ns do 

A^[is][js]=0- b^[is]=0 
for iq = 1 to Nq do 

(calculate possibly coefficients c) 

update A®[is][js] using c\iD][jD][iQ], (i>[iD][is][iQ] and 4>[jD\\js][iQ\ 
if is = js then 

I update b®[is] using djinWig] and </>[fD][is][*Q] 

end if 
end for 

store A®[is][js] and b®[is] (if is = js) in global memory 

end for 
end for 
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and to make the output for the same variant from different compilers similar. 
There are other different optimizations possible, such as common subexpression 
elimination, loop unrolling or induction variable simplihcation. In order to pre¬ 
serve the portability of the procedures for different processor architectures, we 
leave the application of these and other optimizations to the compiler. Instead, 
aiming particularly at optimization of the execution on graphics processors, we 
explicitly consider the placement of different arrays appearing in calculation in 
different levels of memory hierarchy. Before we present the details, we briefly 
introduce the programming model and example hardware platforms for which 
we design the implementations. 

3.4 Programming model 

We want to compare the optimizations and performance of execution for numer¬ 
ical integration on several types of hardware - multi-core CPUs, GPUs, Xeon 
Phi - but do not want to consider a radically different programming models. 
We choose a programming model that offers the explicit usage of shared mem¬ 
ory (necessary for GPUs), but that can be used also for architectures based on 
standard GPU cores (we include in this category also Xeon Phi architecture). 
The model has to allow for efficient vectorization, as well as other standard 
optimizations. 

The model that we choose is based on models developed for GPUs (CUBA 
[55] . OpenGL [2]), but tries to simplify them. The goal is to obtain the model 
simple enough, so that application programmers can use it, but having enough 
features to allow for high performance implementations on different contempo¬ 
rary hardware. Our final implementations are all done in OpenCL due to the 
fact of existing software development tools for all processors considered. 

We assume that the code is written in the SPMD (single program multiple 
data) fashion, for threads that execute only scalar operations. Several threads 
can be grouped together, so that hardware can run them in lockstep, creating 
vectorised parts of the code. It is up to the compiler and the hardware to per¬ 
form such vectorization. This vectorization is, in current programming environ¬ 
ments, hidden from the programmer, but cannot be neglected when analysing 
the performance of implementation. The vectorization is considered the stan¬ 
dard form of execution for GPUs and an important performance improvement 
for architectures with standard GPU cores. 

Because of the lockstep execution, we will call a group of threads perform¬ 
ing vectorised operations a SIMD group (this term directly corresponds to the 
notions of NVIDIA warps and AMD wavefronts). The number of threads in 
a SIMD group depends on scheduling and execution details of each processor 
architecture. For the contemporary GPUs it is usually equal to 32 or 64, while 
for the current CPU like cores it is related to the width of vector registers and 
can be equal to 4, 8 or ldil. 

^In the model, we adopt the point of view on execution on standard CPU cores taken from 
Intel OpenCL compiler m- In that perspective, a traditional CPU thread performing vector 
operations is seen as a SIMD group of threads, while for scalar operations, it is assumed that 
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One or several SIMD groups form a set of threads, that is called a thread- 
block in CUBA and a workgroup in OpenCL. We include that notion in the 
programming model and use the OpenCL word workgroup. Threads in a sin¬ 
gle workgroup are run on the same processor core (by the core we understand 
a standard CPU core, an NVIDIA streaming multiprocessor, an OpenCL com¬ 
puting unit). They share access to explicitly available fast memory (that we call 
further shared memory, adopting the CUBA convention, as opposed to using 
the OpenCL notion of local memory) and can be explicitly synchronised using 
the barrier construct. 

Threads from different workgroups are not synchronised and form a space 
of threads, executing concurrently a part of the code constituting a (computa¬ 
tional) kernel. We assume that the whole code, in our case a code performing 
finite element simulations, that contains complex and large data structures, is 
executed in the traditional way on standard CPUs with only several kernels 
ojfloaded to (massively) multi-core accelerators. 

The act of offloading is not a necessary step in executing codes in our 
model. For future architectures, having standard memories accessible to mas¬ 
sively multi-core processors with SIMD capabilities, it can be eliminated, with 
kernels denoting at that moment only carefully parallelized, vectorised and op¬ 
timized parts of codes. 

A programmer in our model can explicitly specify a storage location for 
selected kernel variables, as global or shared memory. In that case, specific, 
hardware dependent, considerations has to be performed to take into account, 
and possible optimize, different ways of accessing data in different levels of 
memory hierarchy. More specifically we assume that the accesses from different 
threads in the same SIMD group executing a single memory operation can be 
organized in such a way that they concern subsequent memory locations in 
memory. If two levels of memory are involved, the rule of subsequent accesses 
is considered more important for the slower memory. 

The above, simple, guideline should work well for GPUs, where it should 
allow for forming fast coalesced memory accesses to global memory [31 m]. 
When it cannot be applied, more complex rules for performing coalesced global 
memory accesses and avoiding bank conflicts for shared memory should be ap¬ 
plied. Also for CPU cores, the guideline should allow for fast vectorised memory 
accesses. If it turns out that the compiler and the hardware cannot perform such 
operations, more traditional approaches, exploiting spatial and temporal local¬ 
ity for each thread can be exploited. 

The variables with no explicit storage location form a set of automatic vari¬ 
ables, that are private to each thread. We assume that actual calculations are 
always performed using a set of registers available to a thread. If the set is large 
enough, all automatic variables can be stored in registers. This is the optimal 
situation, that can often happen for simple kernels executed on GPUs. When 
this is not the case, register spilling occurs, the values are stored in different 
levels of cache memory and eventually can be transferred, with relatively high 

they are performed by a single thread in the SIMD group. 
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Table 1: Characteristics of processors used in computational experiments. 



Processor 


Tesla 

Xeon 

Xeon 


K20m 

Phi 

E5-2620 


(GKllO) 

5110P 

(x2) 

Number of cores 

13 

60 (59) 

2x6 

Core characteristics 

Number of SIMD lanes (SP/DP) 

192/64 

16/8 

8/4 

Number of registers 

65536x32bit 

4x32x512bit 

2xl6x256bit 

Shared memory (SM) size [KB] 

16 or 48 

32 (?) 

32 (?) 

LI cache memory size [KB] 

64—SM size 

32 

32 

L2 cache memory size [KB] 

?»118 

512 

256 


cost, to global memory. 

3.5 Hardware platforms 

Optimal implementation of an algorithm has to consider specific features of 
hardware platforms for which it is designed. In the paper we try to find, for the 
particular problem of finite element numerical integration, the most important 
dependencies between the processor architectures and the performance of codes. 
We consider three types of computing devices, popular in HPC and having 
a strong presence on the current Top500 lists [38]. We do not describe the 
platforms in details, but compare several characteristics, important, in our view, 
for execution of numerical integration procedures (and definitely other scientific 
codes as well) that are further exploited when optimizations of these procedures 
are analysed. 

The first platform is a standard contemporary multi-core processor, repre¬ 
sented by Intel Xeon E5-2620. It has 6 Sandy Bridge cores, each of which sup¬ 
ports two-way multithreading. The second platform, an Intel Xeon Phi 5110P 
accelerator card has 60 cores, derived from Pentium architecture and support¬ 
ing 4-way multithreading (our OpenCL implementations can use 59 cores). The 
third platform is an NVIDIA Tesla K20m accelerator card with Kepler (GKllO) 
GPU. 

Table □ compares several important characteristics of the considered pro¬ 
cessors. The data are selected to be important for the performance oriented 
programmers and reflects purely hardware features and some details of pro¬ 
gramming environments. By the core, as it was already mentioned, we denote 
a CPU core or a compute unit (streaming multiprocessor) for GPUs. For cer¬ 
tain computing environments (as e.g. OpenCL that we use in numerical tests) 
the fact of supporting multithreading is reflected by multiplying the number of 
compute units (Intel OpenCL compiler reports the number of computing units 
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for Xeon Phi equal to 236, and for Xeon E5-2620 equal to 12). Since hyper¬ 
threading does not necessarily lead to proportional performance improvement, 
we leave in the table the number of physical cores for the Xeon processors. 

The notion of SIMD lanes, in the meaning that we use in the paper, reflects 
the ability of a hardware to perform floating point operations (single or double 
precision), as it is revealed to programmers through the speed of execution of 
suitable low level instructions. The notion of SIMD lanes is directly related to 
the notion of maximal floating point performance, expressed in TFlops, that 
can be computed by multiplying the maximal number of operations completed 
by a SIMD lane in a single cycle (usually 2 for FMA or MAD operations) by 
the number of SIMD lanes and the frequency of core operation in THz. For 
CPU cores (in Xeon and Xeon Phi processors) SIMD lanes form parts of vector 
units, for NVIDIA GPUs SIMD lanes correspond to CUDA cores and double 
precision floating point units. 

The number of registers, reveals the number reported in instruction sets 
and programming manuals. The actual performance of the codes depends on 
this number but also on the number of physical resources associated (e.g. in 
standard CPU cores, the number of physical registers is larger than the number 
available in instruction sets due to the capabilities of out-of-order execution). 

The picture of fast memory resources for contemporary processors is com¬ 
plex. Not only the details of hardware implementation (the number of banks, 
bus channels, communication protocols), but also the configurations change 
from architecture to architecture. For GPUs shared memory forms a separate 
hardware unit (but in the case of NVIDIA GPUs its size can be selected by 
the programmer). For standard CPUs, even if the programming model includes 
the notion of shared memory, it can be neglected in the actual execution (as is 
the case e.g. for Intel OpenCL compiler that, reportedly, maps it to standard 
DRAM memory and uses standard for CPU cache hierarchy mechanisms [la¬ 
the reason for placing ”?” in Table [1]). 

The practical computing power of processors is related to the number of 
SIMD lanes but there may exist important differences between the different 
types of processor cores: each one can require a different number of concur¬ 
rently executing threads (per single SIMD lane) to reach its maximal practical 
performance (this fact concerns also the performance of memory transfers). For 
standard CPU cores (as in the Xeon processor), one thread per single lane can 
suffice to reach the near peak performance. This is related to the complex ar¬ 
chitecture of cores with out-of-order execution and hardware prefetching. For 
simpler Xeon Phi cores, 4 threads are necessary to reach the peak performance. 
Fven more threads are required for NVIDIA GPUs: usually several threads are 
sufficient to hide arithmetic operation latencies, while several tens of threads 
may be needed (in the case when the threads do not issue concurrent memory 
requests) to allow for hiding the latencies of DRAM memory operations [ITIHS] . 

These facts have important consequences for designing high performance 
codes for the processors. The speed of execution strongly depends on the speed 
of providing data from different levels of memory hierarchy (registers, caches, 
DRAM). It is better to use resources that are faster, but their amount is limited 
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Table 2: Memory resources per single thread, in terms of the number of double 
precision data that can be stored, for the processors used in the study (1 thread 
per SIMD lane for Xeon CPU and 4 threads per SIMD lane for Xeon Phi and 
Kepler GPU assumed - explanation in text). 



Processor 


Tesla 

Xeon 

Xeon 


K20m 

Phi 

E5-2620 


(Kepler GKllO) 

5110P 

(x2) 

Number of DP registers 

128 

32 

16 

Number of DP entries in SM 

8 or 24 

128 (?) 

1024 (?) 

Number of DP entries in LI cache 

32—SM number 

128 

1024 

Number of DP entries in L2 cache 

R::59 

2048 

8192 


for each core. Table [2] presents the resources available to a single thread when 
the number of threads per SIMD lane is equal 1 for standard Sandy Bridge core 
in the Xeon CPU and 4 for Xeon Phi and Kepler GPU. The number of threads 
per SIMD lane is selected to some extent arbitrarily, but can be understood as 
the minimal number of threads when high performance of execution is sought. 

It can be seen that the processors exhibit different characteristics. Stan¬ 
dard Xeon cores (Sandy Bridge in this example) have relatively large caches 
with less registers. Xeon Phi cores, that are based on previous Intel GPU de¬ 
signs, still have large caches (although smaller than for Sandy Bridge), but, 
according to AVX standard, possess twice as much registers. Kepler streaming 
multiprocessors offer relatively large number of registers with relatively small 
fast memory (especially L2). If software requirements, for register per thread 
or shared memory per workgroup of threads, are large, then the GPU, either 
reduces the number of concurrently executed threads (that usually decreases 
the performance) or even cannot execute the code at all. 

All considered processors exhibit complex behaviour with respect to using 
the memory hierarchy. First generations of GPUs had not caches and, in order 
to get maximum performance, the programmers had to ensure that the number 
of requested registers did not exceed the hardware limits and that the DRAM 
memory accesses strictly adhered to the rules of coalescing, specific to differ¬ 
ent generations of GPU architectures. Current architectures, due to the use of 
caches, may not result in excessive penalties for not obeying the two mentioned 
above principles. This makes the picture of GPU performance less clear: some¬ 
times it is better to allow for register spilling (when due to the too large number 
of requested registers the data has to be stored in cache), if this decreases the 
requirements for shared memory, allowing more threads to run concurrently. 
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Performance analysis and optimization of nu¬ 
merical integration kernels 

4.1 Memory requirements 

We start the performance analysis of numerical integration kernels by specifying 
the memory requirements that can be compared with the resources offered by 
different processors. The requirements determine the possible placement of data 
in different levels of memory hierarchy and, in consequence, the number of 
accesses to different levels of memory and the number of operations performed 
during execution. 

From now on, in the rest of the paper, we concentrate on four special cases, 
that present important alternatives related to implementation of numerical inte¬ 
gration. We study two types of elements: tetrahedral (geometrically linear) and 
prismatic (geometrically multi-linear) and two types of problems (both scalar): 
with Laplace operator on the left hand side and, hence, no PDF coefficients 
corresponding to computing stiffness matrix entries and with the full set of 
coefficients c (as can happen e.g. for different varieties of convection-diffusion- 
reaction equations). For the first case, denoted later on as ’’Poisson” we consider 
a varying right hand side term, that appears in the numerical integration as a 
single separate coefficient for each integration point. For the second case, de¬ 
noted ”conv-diff” the coefficients c and d are assumed to be constant for the 
whole element. 

Table [3] presents the basic parameters for the four considered cases of nu¬ 
merical integration and the sizes (in the number of scalar values) for the arrays 
that appear in calculations. The data shows a broad range of requirements from 
several tens of scalars to more than two hundreds (these numbers can grow to 
several hundreds if e.g. non-constant PDF coefficients are considered). 

The smallest requirements are associated with SSQ versions of algorithms. 
This can lead to the shortest execution times, however, for prismatic elements 
the SSQ version is accompanied by redundant calculations of Jacobian terms. 
On the other hand, the QSS versions, that avoid this redundancy require more 
resources. The SQS versions are in between, with registers required at most for 
a single row of stiffness matrix and with Jacobian terms calculations repeated 
for each row. 

When comparing the numbers in Table |3] with the resources offered by the 
processors for a single thread, specified in Table [2l it can be seen that for 
some cases and some processors it could be possible to store all data necessary 
for numerical integration in registers. For other cases more levels of memory 
hierarchy has to be used. The actual placement of data will depend on the 
version of the algorithm and optimizations performed. 

4.2 Input and output arrays 

Because of the importance of coalescing memory accesses to DRAM memory 
for graphics processors, the organization of input and output arrays can have 
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Table 3: The number of shape functions Ns, the number of Gaussian integration 
points Nq and the associated memory requirements for the arrays appearing in 
numerical integration kernels for first order approximations, two popular types 
of finite elements: tetrahedral (tetra) and prismatic (prism) and two types of 
problems: with Laplace operator (Poisson) and with all convection-diffusion- 
reaction terms (conv-diff) 



Type of problem: 

Poisson 1 conv-diff 


tetra 

Type of 
prism 

element 

tetra 

prism 

Ns 

4 

6 

4 

6 

Nq 

4 

6 

4 

6 

Integration data (reference element) 

16 

24 

16 

24 

Input/output data for each finite element 

Geometry data (input) 

12 

18 

12 

18 

PDE coefficients (sent as input) 

4 

6 

20 

20 

Stiffness matrix A® (output) 

16 

36 

16 

36 

Load vector b® (output) 

4 

6 

4 

6 

Data for single integration point - QSS algorithm: 

PDE coefficients c and d 

1 

1 

20 

20 

Shape functions and derivatives <p 

16 

24 

16 

24 

Total (including A® and 6®) 

37 

67 

56 

86 

Data for all integration points - QSS algorithm: 

PDE coefficients c and d 

4 

6 

20 

20 

Shape functions and derivatives cj) 

28 

144 

28 

144 

Total (including A® and 6®) 

52 

192 

68 

206 

Data for all integration points - SSQ algorithm: 

PDE coefficients c and d 

4 

6 

20 

20 

Shape functions and derivatives cj) 

8 

8 

8 

8 

Total (including one entry from A® and b®) 

14 

16 

30 

30 
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significant influence on performance of numerical integration kernels. 

The arrays passed as arguments to the procedure (i.e. geometrical data and 
PDE coefficients) can be either used directly in calculations or first retrieved to 
shared memory or registers and than used many times in actual calculations. 
The first option can work well for CPU cores, for which the number of registers is 
relatively small, the compilers perform aggressive optimizations and the data is 
automatically placed in different levels of cache memory. For GPUs, the level of 
automation for achieving optimal performance of memory accesses is lower, and 
usually the programmers are responsible for ensuring the optimal organization 
of accesses. We adopt this point of view and explicitly design kernels with the 
minimal number of global memory accesses and, whenever possible, coalesced 
way of accessing DRAM by different threads. This means that we adopt only 
the second strategy, with the input data rewritten to local arrays, stored in 
registers or shared memory. In order to maintain the portability of kernels, we 
use this strategy for both CPU cores and GPUs, assuming that CPU compilers 
can perform the proper optimizations of kernels despite this additional step. 

The process of rewriting input data still can have several different forms. 
When threads rewrite subsequent data to registers, each thread refers to sub¬ 
sequent data concerning the element it currently processes (because of the one 
element-one thread parallelization strategy). In order to ensure coalesced ac¬ 
cesses for threads in a single SIMD group, the data for different elements should 
be placed in subsequent memory locations in DRAM memory, while data for 
a single element should be spread over distant memory locations. The part of 
the code responsible for passing input arguments should write properly input 
arrays, taking into account such parameters as the size of SIMD groups, the 
number of threads etc. We consider this as too far-reaching requirements and 
assume that the data is passed in a form of arrays in a ’’natural” order, i.e. the 
data for a single element are placed together. 

In this case, when rewriting data directly to registers the accesses to global 
memory are not coalesced. The typical remedy to that, used often for GPUs, is 
to employ shared memory as a temporary storage with shared access for threads 
from a single workgroup. The threads read input data in a coalesced way, with 
subsequent threads from a single workgroup referring to subsequent memory 
locations, and than write to shared memory, as is illustrated in Fig. [1] The 
order of writing to and reading from shared memory is assumed not to have 
the influence on performance and is not optimized (the order of data is left the 
same as in DRAM memory). 

The data placed in shared memory can be left there and later used in cal¬ 
culations, or can be rewritten to registers, with shared memory possibly reused 
for other data. Both options are taken into account when considering the opti¬ 
mization of kernels. 

We adopt a different strategy for output data. Since we leave produced stiff¬ 
ness matrices and load vectors for further use by other kernels or standard CPU 
code, we assume that the procedures that read them can adapt to the format 
determined by integration kernels. We also design the versions with coalesced 
and non-coalesced memory accesses, but now the versions with coalesced mem- 
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Global memory 

Element 1 data Element 2 data 

Element 3 data 


1 1 1 1 1 1 

III! 


1 1 1 


III 1 

Ill 1 

1 1 1 -i 


Ml i“ 

1 1 1 - 

Mill 

III! 


1 1 1 

.... ... 

Element 1 data Element 2 data 

Shared memory 

Element 3 data 



(a) 


Element 1 data 

Global memory 

Element 2 data 

Element 3 data 


1 1 1 1 1 

1 1 1 1 1 

1 1 1 1 1 


ii| bj bj bsl 

bj b| b| Ids| 

bj b| b| Ids| 


1 1 1 1 

III 1 

1 1 1 1 


^ 1 1 1 ^ 1 

~l 1 ^ 1 M 

^ 1 ^ 1 M 


Element 1 data - thread 1 

Element 2 data - thread 2 
Local tables 

(b) 

Element 3 data - thread 3 



Figure 1: Organization of reading input data for numerical integration: a) co¬ 
alesced reading, b) non-coalesced reading. Subsequent threads in a workgroup 
with size WS are denoted by Tj,j = 1,WS, while ik,k = 1, ..,DS denote sub¬ 
sequent iterations performed by a single thread while reading a data array of 
size DS 


ory accesses do not use additional shared memory buffers and write the data 
directly from data structures related to threads (that can be stored either in 
registers or in shared memory). As a consequence the data for single element are 
now spread over distant memory locations and the procedures that read them 
must take into account the particular formats, that depend on details of the 
organization of calculations. For non-coalesced writing, a single thread writes 
the data related to a single element to subsequent memory locations - the sit¬ 
uation that can be advantageous for CPU cores with not vectorized memory 
operations. The situations of coalesced and non-coalesced memory accesses of 
output data are illustrated in Fig. [5J 

4.3 Shared memory usage 

The algorithms presented so far can be used to create a proper code for compilers 
in different software development kits for multi-core processors. For architec¬ 
tures based on standard CPU cores, the compilers can produce codes that can 
reach high performance, thanks to different automatic optimizations. For GPUs, 
however, there exists one more important optimization, that can have substan¬ 
tial impact on performance and that has to be taken into account explicitly in 
the source code. 

This optimization is the use of shared memory, either as a mean for inter- 
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Figure 2: Organization of writing output data for numerical integration: a) co¬ 
alesced writing, b) non-coalesced writing. Subsequent threads in a workgroup 
with size WS are denoted by Tj,j = 1,WS, while ik,k = 1, ..,DS denote sub¬ 
sequent iterations performed by a single thread while writing a data array of 
size DS 
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thread communication or an explicitly manageable cache. The first option has 
been already investigated in the previous section, for the purpose of optimal 
reading of input data. The second option is related to the register requirements 
of the kernels. If this requirements are too large, the variables designed as local 
(private to each thread) are spilled to caches and DRAM memory. The last 
situation, that can happen often due to small cache resources of GPUs, can lead 
to severe performance penalties. In order to relieve register pressure some of 
data used in calculations can be placed explicitly in shared memory. Shared 
memory accesses are usually several times slower than register accesses but, in 
turn, several times faster then (even coalesced) DRAM accesses. 

One thing has to be taken into account for such designs. When one variable 
is designed to be placed in shared memory instead of a register, the kernel has 
to allocate the place in shared memory for all threads from a single workgroup. 
Storing an array with /c-entries for a single thread in shared memory, requires 
the allocation of k times the size of workgroup entries. Since workgroups sizes 
for GPUs are designed as multiples of SIMD groups sizes (with the minimum 
recommended both for AMD and NVIDIA GPUs equal to 64), moving data 
to shared memory involves relatively large resource requirements. Because of 
that, when designing numerical integration kernels, we consider the placement 
in shared memory of only one array appearing in final calculations of stiffness 
matrix and load vector entries, either the one for PDE coefficients or for shape 
functions and their derivatives or, eventually, for the output A® and 5® matrices. 

It is difficult to investigate all possibilities for placement of data in shared 
memory (including the options of reusing shared memory after coalesced reading 
for some other (than input) arrays). The number of options can reach tens (or 
even hundreds if the assumption of placing only one array in shared memory is 
abandoned). The factors that influence the performance in these cases include, 
apart from the number of accesses to shared memory, that can be induced e.g. 
from the analysis of source code, also the relative sizes and speeds of different 
levels of memory, as well as compiler strategies for allocating registers, that 
influence the number of accesses to caches and global DRAM memory. These 
factors can differ much for different architectures. 

Because of that, in order to select the best option, we choose a popular in 
HPG strategy of parameter based tuning. We create a single code with options 
concerning the placement of particular arrays in shared memory, that can be 
switched on or off at compile time, and perform an exhaustive search of the 
whole space of options by running several tens of variants of the procedure. 

4.4 Operation count 

The operations performed during kernel executions include global memory ac¬ 
cesses, shared memory accesses and the remaining operations. Among the latter 
we are interested only in floating point operations. We base our execution time 
estimates, developed always for a single finite element, on the numbers of global 
memory accesses and floating point operations, due to the difficulties with tak¬ 
ing into account shared memory and cache accesses. 
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When performing analysis at this level, the final optimizations, performed 
by software (compilers) and hardware should be taken into account. This can be 
difficult, especially when different architectures are taken into account. We try 
to develop the estimates, considering such classical optimizations as loop invari¬ 
ant code motion, common subexpression elimination, loop unrolling, induction 
variable simplification, etc. (that as we mentioned earlier are not directly in¬ 
troduced into the source code, but left for the compilers). In the estimates 
we accept the minimal number of performed operations and, when calculating 
performance data for processors, we base them on the estimated numbers. 

The number of global memory accesses is relatively easy to estimate. As was 
already mentioned, in our implementations we accept the principle of explicitly 
managing the accesses to global memory, i.e. we do not use global arrays in 
calculations but rewrite the data to shared memory or registers (local, automatic 
variables) and than use the latter in computations. Hence, the number of global 
memory accesses can be estimated based on the sizes of input and output arrays 
for the numerical integration procedures (Table[3]). All variants (QSS, SQS and 
SSQ) perform the same number of global memory accesses reported in the first 
line of Table S) 

The situation is much more complex for shared memory accesses and floating 
point operations. First, for different variants of kernels there will be different 
organization of operations. Moreover, for different organizations compilers can 
perform different sets of optimizations. We present the estimated numbers for 
shared memory accesses and floating point operations in Table|4l In deriving the 
estimates we tried to take into account the possible optimizations, related e.g. to 
the facts that certain values are constant during calculations (e.g. derivatives of 
shape functions for different integration points within the tetrahedral element, 
PDF coefficients for different integration points in the convection-diffusion case) 
and that the resulting stiffness matrices may be symmetric. 

Calculations in all versions of numerical integration algorithm contain several 
phases, such as: 

• computing real derivatives of geometrical shape functions (9 operations 
for tetrahedral elements, 126 operations per integration point for prisms) 

• computing Jacobian terms (49 operations in all cases, but performed once 
for tetrahedra and repeated for each integration point for prisms) 

• computing real derivatives for shape functions (60 operations for tetrahe¬ 
dra and 90 operations per integration point for prismatic elements) 

• calculating the entries for the stiffness matrix and the load vector 

The final calculations are performed in a triple loop over integration points 
and shape functions, but in many cases the loops are fully unrolled by the 
compiler and the resulting numbers of operations are the same for all variants 
of numerical integration. This happens for tetrahedral elements due to many 
calculations moved outside all loops (since derivatives of shape functions are 


24 


Table 4: The number of global and shared memory accesses (for shared memory 
there are considered only additional accesses not related to reading data from 
or writing data to global memory), the number of operations and the resulting 
arithmetic intensities for numerical integration kernels for first order approxi¬ 
mations, two popular types of finite elements: tetrahedral (tetra) and prismatic 
(prism) and two types of problems: with Laplace operator (Poisson) and with 
all convection-diffusion-reaction terms (conv-diff) 



Type of problem: 

Poisson 1 conv-diff 


tetra 

Type of element 
1 prism 1 tetra 

1 prism 

The number of global memory accesses: 

all variants 

36 

66 

52 

80 

The number of auxiliary shared memory accesses - QSS algorithm 

Geometry data in shared memory 

12 

108 

12 

108 

PDE coefficients c in shared memory 

4 

6 

80 

120 

Shape functions cf) in shared memory 

60 

210 

60 

210 

and in shared memory 

180 

546 

180 

546 

The number of operations 

QSS algorithm 

290 

2700 

986 

4806 

SQS algorithm 

290 

10416 

986 

12492 

SSQ algorithm 

290 

54876 

1623 

65232 

Arithmetic intensities 

QSS algorithm 

8 

41 

19 

60 

SQS algorithm 

8 

158 

19 

156 

SSQ algorithm 

8 

831 

31 

815 
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Table 5: Performance characteristics of processors used in computational exper- 
iments._ 



Processor 


Tesla 

Xeon 

Xeon 


K20m 

Phi 

E5-2620 


(GKllO) 

5110P 

(x2) 

Peak DP performance [TFlops] 

1.17 

1.01 (0.99) 

2x0.096 

Peak memory bandwidth [GB/s] 

208 

320 

2x42.6 

Limiting arithmetic intensity (DP) 

45 

25 

18 

Benchmark (DGEMM) performance 

1.10 

0.84 

0.18 

Benchmark (STREAM) bandwidth 

144 

171 

67 

Limiting arithmetic intensity (DP) 

61 

39 
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constant over the element). For prismatic elements the fact that several cal¬ 
culations must be repeated for each integration point results in much larger 
number of operations for the SQS and especially SSQ versions of the algorithm, 
as compared to the QSS version. 

Based on the estimated numbers of global memory accesses and floating 
point operations we calculate arithmetic intensities (defined as the number of 
operations performed per global memory access) for the different variants of 
integration kernels and different cases considered. The calculated values, re¬ 
ported in Table IH can be compared with performance characteristics of the 
processors used in the study. We present such characteristics in Table [H For 
each processor we report its peak performance and benchmark performance for 
the two types of operations - memory accesses and floating point operations 
(as benchmarks we consider STREAM for the memory bandwidth and dense 
matrix-matrix multiplication (DGEMM) for floating point performance; more¬ 
over from now on we restrict our analysis to only double precision calculations, 
as more versatile than single precision). Based on these data and assuming 
the simple roofline model of processor performance |40j , the limiting arithmetic 
intensity, i.e. the value of arithmetic intensity that separates the cases of the 
maximal performance limited by memory bandwidth from that limited by the 
speed of performing arithmetic operations (pipeline throughput), is calculated 
for each processor and theoretical and benchmark performance. 

Comparing the data in Table|4]and Table[5]we see that the performance when 
running different variants of numerical integration kernels can be either memory 
bandwidth or processor performance (pipeline throughput) limited. For tetra¬ 
hedral elements the performance for both example problems is memory limited 
for all architectures (excluding the not optimal SSQ version). For prismatic 
elements, even when benchmark (i.e. higher) limiting arithmetic intensity is 
considered, the performance of kernel execution should be pipeline limited for 
Xeon and Xeon Phi architectures. For the Kepler architecture and the optimal 
QSS variant the performance will be memory limited for Poisson problem, while 
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is on the border between being memory or pipeline limited for the convection- 
diffusion example. For other variants the performance is pipeline limited but 
the large number of operations in these cases makes them far from optimal. 

4.5 Execution time 

Based on the data in Tables |4] and [5] we perform final estimates of execution 
time for different variants of numerical integration and different architectures. 
We calculate separately the estimated time determined by the number of global 
memory accesses and the benchmark (STREAM) memory bandwidth of proces¬ 
sors and the time determined by the number of floating point operations and 
the arithmetic throughput in DGEMM. Table |6] contains the longer times from 
the two, computed as the estimate for each architecture. 

The values in Table [S] form lower bounds for the times possible to achieve. 
There are many factors that can slow down the execution of algorithms. The 
most important include: 

• too low occupancy for GPUs (too small number of concurrently executing 
threads in order to effectively hide instruction or memory latencies) - 
caused by excessive requirements for registers or shared memory 

• shared and cache memory accesses for GPUs, not considered in Table [6] 
and caused by either explicit use of shared memory or register spilling to 
caches (or even to global memory) 

• using other than FMA or MAD instructions - peak performance is ob¬ 
tained by all the considered processors for algorithms using almost exclu¬ 
sively FMA or MAD operations that double the performance as compared 
to using other floating point instructions 

• employing barriers to ensure the proper usage of shared memory 

4.6 Transfer from and to host memory 

The last issue that we deal with, is the one that is often considered first - the 
time required for transfer of data between host memory and accelerator memory 
(when these two memories are separate, i.e. when accelerator is connected with 
the GPU using PGIe bus, as is the case for Tesla K20m and Xeon Phi). 

The reason why we consider it last is that, based on the discussions per¬ 
formed so far, it is evident that if numerical integration is considered alone, 
without considering other procedures of finite element processing, the time re¬ 
quired for transferring input and output data (or even input data alone) strongly 
dominates the time for performing actual computations. This becomes obvious 
after realizing that the performance of numerical integration kernels is usually 
memory bound or close to being memory bound, that the amount of data in 
global memory accessed by accelerator during calculations is the same as that 
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Table 6: Estimates for the execution time of numerical integration kernels for 
first order approximations, two popular types of finite elements: tetrahedral 
(tetra) and prismatic (prism), two types of problems: with Laplace operator 
(Poisson) and with all convection-diffusion-reaction terms (conv-diff) and three 
example processors representing Xeon, Xeon Phi and Kepler architectures (times 
are reported in nanoseconds for a single element) 



Type of problem: 

Poisson 1 conv-diff 


tetra 

Type of element 
1 prism 1 tetra 

1 prism 

Tesla K20m (GKllO - Kepler) 

QSS algorithm 

2.00 

3.67 

2.89 

4.44 

SQS algorithm 

2.00 

9.47 

2.89 

11.36 

SSQ algorithm 

2.00 

49.89 

2.89 

59.30 

Xeon Phi 5110P 

QSS algorithm 

1.68 

3.21 

2.43 

5.72 

SQS algorithm 

1.68 

12.40 

2.43 

14.87 

SSQ algorithm 

1.68 

57.87 

2.43 

69.40 

Xeon E5-2620 (two socket configuration) 

QSS algorithm 

4.30 

15.00 

6.21 

26.70 

SQS algorithm 

4.30 

57.87 

6.21 

69.40 

SSQ algorithm 

4.30 

304.87 

9.02 

362.40 
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of host-accelerator memory transfer and that the bandwidth of PCIe is several 
times lower than that of either GPU or Xeon Phi global memory. 

Because of that we do not discuss this issue here. Apart from the fact that 
such a discussion should include broader context of finite element simulations, 
there is another reason for not going into details. The solution with PCIe CPU- 
accelerator connection is universally considered as one of the most important 
performance bottlenecks for accelerator computing. Hardware providers are 
considering many different options that should change this situation. When 
these solutions become available, the detailed analyses of the influence of host 
memory-accelerator memory transfer on performance of codes can be performed 
(for some initial experience see e.g. [IS]). 


5 Computational experiments 

We test the three versions (QSS, SQS, SSQ) of numerical integration kernels for 
the four described example cases (Poisson and convection-difffusion for tetra¬ 
hedral and prismatic elements) and the three selected processor architectures 
Intel Xeon E5-2620, Intel Xeon Phi 5110P and NVIDIA Tesla K20m with Ke¬ 
pler GKllO architecture. For all processors we use 64-bit Linux with 2.6.32 
kernel. For Xeon processors we use Intel SDK for OpenGL Applications version 

4.5 (with OpenGL 1.2 support). For NVIDIA GPU we use CUDA SDK version 

5.5 with OpenGL l.I support and 331.20 driver. We use the same kernels for 
all three architectures, changing only the size of workgroups: 8 for Xeon, 16 for 
Xeon Phi and 64 for Kepler. 

For each particular kernel and problem we perform a series of experiments 
with different combinations of parameters determining the use of shared memory 
in computations. In Table |7| we present the best times for each kernel, indicat¬ 
ing additionally what percentage of the best performance (associated with the 
times presented in Table H]) was achieved. Figures El HI and [5] illustrates the 
comparison of performance for each case of problem and approximation and 
different processors. 

The detailed results of experiments lead to several remarks belonging to 
different domains of interest. For practitioners performing finite element simu¬ 
lations on modern hardware the interesting facts include: 

• the execution time of finite element integration depends strongly not only 
on the order of approximation (this fact is obvious from the classical com¬ 
plexity analysis), but also on the type of element and the kind of problem 
solved. The differences are not so significant as in the case of different 
approximation orders EHlIIilll], but still can reach an order of magnitude 

• due to large differences in execution time the problem of proper mapping 
to hardware and optimization of numerical integration may be more or less 
important for the time of the whole finite element simulation, depending 
also e.g. on the time required by the solver of linear equations (if present) 

El ED) 
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Table 7: Experimental execution times and the achieved percentage of the best 
performance associated with times in Table [S] for numerical integration kernels 
for first order approximations, two popular types of finite elements: tetrahedral 
(tetra) and prismatic (prism), two types of problems: with Laplace operator 
(Poisson) and with all convection-diffusion-reaction terms (conv-diff) and three 
example processors representing Xeon, Xeon Phi and Kepler architectures (times 
are reported in nanoseconds for a single element). 



Type of problem: 

Poisson 1 conv-diff 


tetra 

Type of element: 
prism 1 tetra 

prism 

Tesla K20m (GKllO - Kepler) 

QSS algorithm 

2.02 (99%) 

12.80 (29%) 

4.46 (65%) 

15.58 (29%) 

SQS algorithm 

2.02 (99%) 

38.39 (25%) 

6.05 (48%) 

64.81 (18%) 

SSQ algorithm 

2.02 (99%) 

94.99 (53%) 

6.13 (47%) 

194.93 (30%) 

Xeon Phi 5110P 

QSS algorithm 

8.77 (19%) 

25.09 (13%) 

17.11 (14%) 

33.29 (17%) 

SQS algorithm 

11.35 (15%) 

41.20 (30%) 

15.97 (15%) 

60.18 (25%) 

SSQ algorithm 

11.43 (15%) 

101.20 (65%) 

16.53 (15%) 

132.46 (59%) 

Xeon E5-2620 (two socket configuration) 

QSS algorithm 

19.31 (22%) 

49.74 (30%) 

26.74 (23%) 

64.27 (42%) 

SQS algorithm 

18.77 (23%) 

128.07 (45%) 

24.86 (25%) 

133.14 (52%) 

SSQ algorithm 

17.76 (24%) 

385.33 (79%) 

26.11 (35%) 

475.41 (76%) 


30 

































QSS QSS SSQ QSS QSS QSS QSS SQS SQS QSS QSS QSS 

Poisson - Tetra Poisson - Prism Conv-Diff - Tetra Conv-Diff - Prism 


Figure 3: Perfromance/Execution times of numerical integration for a single 
tetrahedral and prismatic element, Poisson and convection-diffusion problem 
and the best version among integration kernels for each of tested architectures: 
Xeon, Xeon Phi and Kepler (Tesla K20m) 


For finite element software developers, especially in the HPC domain, there 
are the following general observations: 

• large resource requirements for some variants of numerical integration (e.g. 
convection-diffusion problems and prismatic elements), limits the perfor¬ 
mance achieved by the GPU architecture in our comparison; the reason 
lies mainly in too small number of threads (SIMD groups) that can run 
concurrently on a single core (streaming multiprocessor) 

• the QSS variants of the numerical integration algorithm turned out to 
be the most versatile and efficient; due to the proper optimizations in 
the source code and performed by the compilers, the performance of QSS 
kernels was either the fastest or slower by only several percent than the 
fastest version 

• detailed results of experiments showed that the performance on our GPU 
(as is also the case for other contemporary GPU architectures) is sensitive 
to the way the data are read and written to memory, coalescing memory 
accesses turned out to be more important for the Kepler processor than 
for both Xeons 

• when the performance of algorithm is memory bound and there is a signifi¬ 
cant influence of the way the data are accessed in global memory (as is the 
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QSS QSS SSQ QSS SQS SQS QSS QSS QSS QSS QSS QSS 


Poisson - Tetra Conv-Diff - Tetra Poisson - Prism Conv-Diff - Prism 

(a) (b) 

Figure 4: Absolute perfromance of numerical integration for Poisson and 
convection-diffusion problem and the best version among integration kernels 
for each of tested architectures: Xeon, Xeon Phi and Kepler (Tesla K20m): a) 
in GB/s for a single tetrahedral element, b) in GFLOPS for a single prismatic 
element, 
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Figure 5: Relative performance (as percentage of the maximal benchmark per¬ 
formance) of numerical integration for a single tetrahedral and prismatic el¬ 
ement, Poisson and convection-diffusion problem and the best version among 
integration kernels for each of tested architectures: Xeon, Xeon Phi and Kepler 
(Tesla K20m). For tetrahedral elements the percentage refers to the STREAM 
memory bandwidth benchmark, for prismatic elements it refers to the DGEMM 
floating point performance (except for Kepler architecture where it is still related 
to the STREAM benchmark). 
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case e.g. for Kepler architecture, tetrahedral element and Poisson prob¬ 
lem), there is a growing importance of data layout in memory (and the 
use of output data by other components of finite element codes) [710 [H] 

• the GPU architecture turned out to be also more sensitive to the use 
of OpenCL shared memory (as suggested by the Intel OpenCL compiler 
manual HZ], this had no significant impact on the performance of Xeons) 

• together, the changes in the way data are read and written to the global 
memory and the changes in shared memory usage caused the differences 
in execution time of the order 2-3 for Xeons, and more than 5 for Kepler 

In general, for some test cases, the performance of obtained portable numeri¬ 
cal integration kernels can be considered satisfactory. Still, for each architecture 
and most of the cases considered, the benchmark results suggest that it should 
be possible to achieve execution times several times shorter (from 3 times for 
Kepler architecture, up to almost 8 times for Xeon Phi). This could however usu¬ 
ally be obtained by either considering the specialized domain specific optimizing 
compilers or by loosing the portable character of the kernels and introducing 
architecture specific optimizations into the source code. Another possibility is 
to consider different parallelization strategies, the option that we plan to follow 
in subsequent publications. 

Finally there are some observations concerning the hardware itself (these 
observations are common to many algorithms, but are confirmed for our example 
case of finite element numerical integration): 

• our example GPU, Tesla K20m, turned out to be the fastest processor 
in the comparison; the observations made by several authors [521 HU 155] 
show that Xeon Phi processors applied to different scientific computing 
algorithms, although usually faster than standard Xeon GPUs, are usually 
slower than the recent HPC targeted GPUs 

• the detailed results of experiments for particular optimization variants of 
kernels showed significant resemblance of Xeon and Xeon Phi performance, 
indicating the similarities between these architectures. 

6 Related work 

The subject of numerical integration on modern multi-core processors, espe¬ 
cially graphics processors, has been investigated in several contexts. The first 
works concerned GPU implementations for specific types of problems, such as 
higher order FEM approximations in earthquake modelling and wave propa¬ 
gation problems [23] . GPU implementations of some variants of discontinuous 
Galerkin approximation m or higher order approximations for electromagnet¬ 
ics problems jS]. 

The second important context for which finite element numerical integra¬ 
tion was considered, is the creation of domain specific languages and compilers. 
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Interesting works for this approach include [301 ini m nzi HH]- The research 
in the current paper can be understood as laying theoretical and experimental 
grounds for possible future finite element compilers, able to create optimized 
code for all types of problems, elements and approximations, as well as different 
processor architectures. 

Finally, numerical integration, in more or less details, have been discussed in 
the context of the whole simulation procedure by the finite element codes. Some 
works in this area include m [31 m [Ml m [lu I3S] . Usually more attention 
to numerical integration has been given in articles that consider the process of 
creation of systems of linear equations for finite element approximations, with 
the important examples such as [ini[3Zl[l[H]- 

The present paper forms a continuation of our earlier works, devoted solely to 
the subject of finite element numerical integration. The first papers considered 
higher order approximations, starting with first theoretical and experimental 
investigations in |341124] and culminating in larger articles devoted to the im¬ 
plementation and performance of numerical integration kernels for multi-core 
processors with vector capabilities [M] (especially IBM Power XCell processor) 
and GPUs g]. 

The investigations in the current paper on one hand touch the narrower 
subject of only first order approximations, but on the other hand concern two 
important types of finite elements (geometrically linear and non-linear) and are 
performed, in a unifying way, for three different processor architectures used 
in scientific computing. This scope, combined with the depth of investigations 
concerning the performance of OpenCL kernels, differentiate the article from 
the other on this subject. 

7 Conclusions 

We have investigated the optimization and performance on current multi- and 
many-core processors for an important computational kernel in scientific com¬ 
puting, the finite element numerical integration algorithm for first order ap¬ 
proximation. We have used a unifying approach of OpenCL for programming 
model and implementation on three popular architectures in scientific comput¬ 
ing: Intel Xeon (Sandy Bridge), Intel Xeon Phi and NVIDIA Kepler. We have 
demonstrated that this approach allows for exploiting multi-core and vector 
capabilities of processors and for obtaining satisfactory levels of performance, 
as compared to the theoretical and benchmark maxima, not loosing the porta¬ 
bility of the developed code (we have used the same OpenCL kernels for all 
architectures). The investigations in the paper reveal that, even for the simple 
problems, elements and approximations as considered in the paper, there is a 
significant variation of required resources and associated complexities for the 
algorithm, that can lead to different problems when mapping to architectures 
of modern processors. Nevertheless, the results of computational experiments 
show that for all the cases considered in the paper, the numerical integration 
algorithm can be successfully ported to massively multi-core architectures, and 
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hence, when used in finite element codes should not form a performance bot¬ 
tleneck. The presented detailed analyses indicate what conditions must be met 
in order to obtain the best performance of the kernels and what performance 
can be expected when numerical integration is used on different processors for 
different types of problems and finite element approximations. 
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